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Abstract 

We are developing a cross-media information retrieval 
system, in which users can view specific segments of 
lecture videos by submitting text queries. To produce 
a text index, the audio track is extracted from a lec- 
ture video and a transcription is generated by automatic 
speech recognition. In this paper, to improve the qual- 
ity of our retrieval system, we extensively investigate 
the effects of adapting acoustic and language models on 
speech recognition. We perform an MLLR-based method 
to adapt an acoustic model. To obtain a corpus for lan- 
guage model adaptation, we use the textbook for a target 
lecture to search a Web collection for the pages associ- 
ated with the lecture topic. We show the effectiveness of 
our method by means of experiments. 

1. Introduction 

Given the growing number of multimedia contents avail- 
able via the World Wide Web and DVDs, retrieving spe- 
cific information relevant to user needs has become cru- 
cial. Because text is one of the most common and ef- 
fective methods to represent user needs, we proposed a 
cross-media information retrieval (CMIR) system 1 1 1, in 
which a user can submit text queries to search an entire 
lecture video program for the relevant segments. 

Because oral presentations are usually organized 
based on text materials, such as textbooks, a user first 
selects text segments (e.g., keywords, phrases, sentences, 
and paragraphs) in a textbook related to a target lecture. 
Then, a text query is generated automatically from one or 
more selected segments. That is, queries can be formu- 
lated even if the user cannot provide effective keywords. 
The user can also submit additional keywords as queries, 
if necessary. Finally, video segments are retrieved and 
presented to the user. 

To retrieve video passages in response to text queries, 
we extract the audio track from a lecture video, generate 
a transcription by means of automatic speech recognition 
(ASR), and produce a text index, prior to system use. 

In our previous work (Q, we proposed a method to 
adapt a language model for ASR to the topic of a target 



lecture. We showed that our method improved the accu- 
racy of ASR and also improved the accuracy of CMIR 
significantly, by means of experiments. 

However, in the previous experiment the vocabulary 
size was limited to 20K, although a larger vocabulary 
size, such as 100K, has been used in recent research. In 
addition, the contribution of acoustic model adaptation on 
ASR has not been investigated. It may be argued that the 
effects of adapting models are overshadowed by increas- 
ing the vocabulary size. To answer this question, in this 
paper we extensively investigate the effects of adapting 
the language and acoustic models on ASR. 

2. Cross-media Retrieval System 
2.1. Overview 

Figure [^depicts the overall design of our CMIR system, 
in which the left and right regions correspond to the on- 
line and off-line processes, respectively. While our sys- 
tem is currently implemented for Japanese, our method- 
ology is fundamentally language independent. For the 
purpose of research and development, we tentatively tar- 
get lecture programs on TV for which textbooks are pub- 
lished. We explain the basis of our system using Figure^ 

In the off-line process, given the video data of a target 
lecture, audio data are extracted and divided into a num- 
ber of segments. Then, a speech recognition system tran- 
scribes each segment. Finally, the transcribed segments 
are indexed as in conventional text retrieval systems, so 
that each segment can be retrieved efficiently in response 
to text queries. 

For speech recognition, we use two adaptation meth- 
ods. To adapt speech recognition to a specific lecturer, we 
perform unsupervised speaker adaptation using an initial 
speech recognition result (i.e., a transcription). To adapt 
speech recognition to a specific topic, we perform lan- 
guage model adaptation, for which we search a large cor- 
pus for the documents associated with the textbook for 
a target lecture. Then, the retrieved documents (i.e., a 
topic-specific corpus) are used to produce a word-based 
N-gram language model. 

We also perform image analysis to extract text (e.g., 



keywords and phrases) from flip charts. These contents 
are also used to improve our language model. However, 
this method is beyond the scope of this paper. 

In the on-line process, a user can view specific video 
segments by submitting any text queries, i.e., keywords, 
phrases, sentences, and paragraphs, extracted from the 
textbook. Any queries not in the textbook can also be 
used. The current implementation is based on a client- 
server system on the Web. Both the off-line and on-line 
processes are performed on servers, but users can access 
our system using Web browsers on their own PCs. 

It should be noted that unlike conventional keyword- 
based retrieval systems, in which users usually submit a 
small number of keywords, in our system users can eas- 
ily submit longer queries relying on textbooks. If sub- 
mitted keywords are misrecognized in transcriptions, the 
retrieval accuracy decreases. However, long queries are 
robust for speech recognition errors, because the effect of 
misrecognized words is overshadowed by the large num- 
ber of words correctly recognized. 

Because our focus in this paper is to investigate the 
accuracy of speech recognition, in the following sections 
we elaborate only on speech recognition and document 
retrieval used to obtain topic-specific corpora. 
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Figure 1 : Overview of our CMIR system. 



2.2. Speech Recognition 

The speech recognition module generates word sequence 
W, given phone sequence X. In a stochastic framework, 
the task is to select the W maximizing P(W\X), which 
is transformed as in Equation Q through the Bayesian 
theorem. 



argmaxP(W|X) = argmaxP(X|V^) • P{W) (1) 



P(X\W) models the probability that the word sequence 
W is transformed into the phone sequence X, and P(W) 
models the probability that W is linguistically accept- 
able. These factors are called the acoustic and language 
models, respectively. 

We use the Japanese dictation toolkit 1 , which in- 
cludes the Julius decoder and acoustic/language models. 
Julius performs a two-pass (forward-backward) search 
using word-based forward bigrams and backward tri- 
grams. The acoustic model was produced from the ASJ 
speech database, which contains approximately 20,000 
sentences uttered by 132 speakers including both gen- 
der groups. A 16-mixture Gaussian distribution tri- 
phone Hidden Markov Model, in which states are clus- 
tered into 2,000 groups by a state-tying method, is used. 
We adapt the provided acoustic model by means of an 
MLLR-based unsupervised speaker adaptation method, 
for which we use the HTK toolkit 2 . 

Existing methods to adapt language models can be 
classified into two categories. In the first category — the 
integration approach — general and topic-specific corpora 
are integrated to produce a language model |2|[3l. Be- 
cause the sizes of those corpora differ, N-gram statistics 
are calculated using the weighted average of the statis- 
tics extracted independently from those corpora. How- 
ever, it is difficult to determine the optimal weight de- 
pending on the topic. In the second category — the selec- 
tion approach — a topic-specific subset is selected from a 
general corpus and is used to produce a language model. 
This approach is effective if a general corpus contains 
documents associated with a target topic, but the N-gram 
statistics in those documents are overshadowed by the 
other documents in the resultant language model. 

We followed the selection approach, because the 10M 
Web page corpus |4| containing mainly Japanese pages 
associated with various topics was publicly available. In 
practice, we performed a preprocessing to discard extra- 
neous pages, such as pages including only word lists and 
script codes. The resultant corpus consists of 7M pages. 

The quality of the selection approach depends on the 
method of selecting topic-specific subsets. An existing 
method 1 5 1 uses hypotheses in the initial speech recogni- 
tion phase as a query to retrieve topic-specific documents 
from a general corpus. However, errors in the initial hy- 
potheses have the potential to decrease the retrieval accu- 
racy. Thus, we use the textbook related to a target lecture 
as a query to improve the retrieval accuracy and conse- 
quently the quality of the language model adaptation. 

2.3. Document Retrieval 

We use an existing probabilistic text retrieval method 1 6 1 
to compute the relevance score between the query, which 
is the textbook for a target lecture, and each document in 



http://winnie.kuis.kyoto-u.ac.jp/dictation/ 
2 http://htk.eng. cam. ac.uk/ 



the Web corpus. The relevance score for document d is 
computed by Equation (|2jl. 

y f (K + 1) ■ f t , d N~n t + 0.5 

* ^•{( 1 - 6 ) + ^r|^} + /M' OS nt + a5 

(2) 

Here, ft , g and /j^ denote the frequency with which term 
i appears in query g and document d, respectively. N 
and n t denote the total number of documents in the Web 
corpus and the number of documents containing term t, 
respectively. dl p denotes the length of document d, and 
avgdl denotes the average length of documents in the 
Web corpus. We empirically set K = 2.0 and b = 0.8, 
respectively. We use content words, such as nouns, ex- 
tracted from transcribed documents as index terms, and 
perform word-based indexing. We use the ChaSen mor- 
phological analyzer 3 to extract content words. The same 
method is used to extract terms from queries. 

3. Experimentation 
3.1. Methodology 

To evaluate the effects of adaptation methods on speech 
recognition, we reused the test collection for our previ- 
ous work (Q. Five lecture programs on TV, for which 
printed textbooks were also published, were videotaped 
in DV and were used as target lectures. Each lecture was 
transcribed manually and the sentence boundaries with 
temporal information were also identified manually. 

Tabled snows details of the five lectures. Each lec- 
ture was 45 minutes long. We shall use the term "word to- 
ken" to refer to occurrences of words, and the term "word 
type" to refer to vocabulary items. The column "#Fillers" 
denoting the number of interjections in speech partially 
shows the fluency of each lecturer. 

To evaluate the accuracy of speech recognition, we 
used the word error rate (WER), which is the ratio of the 
number of word errors (deletion, insertion, and substi- 
tution) to the total number of words. We also used test- 
set out-of-vocabulary rate (OOV) and trigram test-set per- 
plexity (PP) to evaluate the extent to which our language 
model adapted to the target topics. 

We used human transcriptions as test set data. For 
example, OOV is the ratio of the number of word tokens 
not contained in the language model for speech recogni- 
tion to the total number of word tokens in the transcrip- 
tion. Note that smaller values of OOV, PP, and WER are 
obtained with better methods. 

To adapt language models, we used the textbook for a 
target lecture and searched the 7M Web page corpus (see 
Section l2~2l for the N relevant pages, which were used as 
a source corpus. We set N — 5000, with which the best 
performance was obtained in a preliminary experiment. 
In the case where the language model adaptation was 

3 http://chasen. aist-nara.ac.jp/ 



not performed, all 7M pages were used as a source cor- 
pus. We used three different vocabulary sizes, i.e., 20K, 
60K, and 100K. In either case, high frequency words in a 
source corpus were used to produce a word-based trigram 
language model. We used the ChaSen morphological an- 
alyzer to extract words from the source corpora, because 
Japanese sentences lack lexical segmentation. 

For lecture #2 we did not perform acoustic model 
adaptation, because the speech data contained constant 
background noise and the sound quality was not good 
enough to adapt the acoustic model. 

3.2. Results 

Table shows the values of OOV, PP, and WER for dif- 
ferent methods. The columns "Base", "+AM", "+LM", 
and "+AL" denote the results obtained by no adaptation, 
acoustic model adaptation, language model adaptation, 
and acoustic/language model adaptation, respectively. 

Note that the values of OOV and PP do not change 
whether or not the acoustic model adaptation was per- 
formed. It should also be noted that because for lecture #2 
the acoustic model adaptation was not performed, the val- 
ues of WER for +AM and +AL are the same as those for 
Base and +LM, respectively. 

Suggestions which can be derived from Tableware as 
follows. First, the values of OOV and PP decreased by 
language model adaptation, excepting lectures #2 and #3 
with 20K and 60K vocabulary sizes. 

Second, the values of WER generally decreased by 
adapting acoustic and language models, independently. 
The improvement obtained by each model was almost 
the same. However, when used together the improvement 
was even greater, resulting in the average value of WER 
was less than 40%. This result is encouraging because 
in the TREC spoken document retrieval track 1 7 1, the de- 
crease of the retrieval accuracy was small if the value of 
WER was 30-40%. 

Third, the values of WER decreased by adapting 
acoustic and language models even for the 60K and 100K 
vocabulary sizes, although in our previous study the vo- 
cabulary size was limited to 20K. 

However, the values of WER themselves did not 
change significantly depending on the vocabulary size. 
One reason is that the average number of word types (i.e., 
the actual vocabulary size) in the 5,000 pages used to 
adapt a language model was approximately 54K, which 
was less than the permissible number. However, in the 
cases where we used all 7M pages, the vocabulary size 
was always the same as the permissible number. 

Finally, by comparing the values of WER for 20K vo- 
cabulary size obtained with the adaptation (+AM, +LM, 
and +AL) and those for 60K and 100K vocabulary sizes 
obtained without the adaptation (Base), one can see that 
adapting the language model and/or the acoustic model 
was more effective than increasing the vocabulary size. 



Table 1: Details of the five lectures used for experiments. 



Lecture ID 


#1 


#2 


#3 


#4 


#5 


Topic 


Criminal law 


Greek history 


Domestic relations 


Food and body 


Solar system 


#Word tokens 


6800 


8040 


7453 


8101 


8235 


#Word types 


1035 


1223 


1026 


905 


929 


#Sentences 


181 


191 


231 


310 


340 


#Fillers 


3 


953 


818 


708 


1134 



Table 2: Experimental results for speech recognition (OOV: test-set out-of-vocabulary rate (%), PP: trigram test-set per- 
plexity, WER: word error rate (%)). 



Lecture 




20K vocabulary 


60K vocabulary 


100K vocabulary 


Base | +AM 


+LM | +AL 


Base | +AM 


+LM | +AL 


Base | +AM 


+LM | +AL 


#1 


OOV 


4.83 


1.19 


2.06 


0.54 


1.56 


0.54 


PP 


53.62 


42.16 


58.73 


45.99 


59.00 


45.99 


WER 


31.53 | 21.76 


21.15 | 12.97 


28.16 | 18.38 


21.24 | 12.84 


28.00 | 17.47 


21.24 | 12.84 


#2 


OOV 


9.97 


7.03 


3.38 


1.15 


2.81 


0.91 


PP 


116.97 


105.70 


203.11 


236.91 


209.80 


240.11 


WER 


53.22 | 53.22 


43.88 | 43.88 


51.55 | 51.55 


43.91 | 43.91 


51.00 | 51.00 


43.88 | 43.88 


#3 


OOV 


5.44 


3.86 


2.55 


1.56 


2.19 


1.56 


PP 


177.73 


156.42 


209.87 


202.37 


214.21 


202.37 


WER 


69.66 | 59.93 


61.26 | 53.29 


69.36 | 59.34 


61.86 | 53.53 


69.27 | 58.80 


61.86 | 53.53 


#4 


OOV 


6.43 


3.12 


2.49 


2.67 


2.04 


2.67 


PP 


126.69 


117.76 


165.08 


123.28 


169.89 


123.28 


WER 


62.30 | 52.50 


48.86 | 39.32 


59.56 | 48.48 


48.90 | 39.86 


58.07 | 47.72 


48.90 | 39.86 


#5 


OOV 


7.33 


4.16 


2.14 


0.47 


1.77 


0.47 


PP 


186.89 


125.36 


266.45 


191.94 


272.74 


191.94 


WER 


78.83 | 64.98 


58.26 | 48.97 


77.62 | 63.90 


58.77 | 48.77 


76.87 | 63.58 


58.77 | 48.77 


Avg. 


OOV 


7.42 


4.16 


2.61 


1.34 


2.14 


1.29 


PP 


132.38 


109.48 


180.65 


160.10 


185.13 


160.74 


WER 


59.93 | 50.83 


47.34 | 39.46 


58.10 | 48.58 


47.59 | 39.57 


57.47 | 47.96 


47.58 | 39.57 



4. Conclusion 

In this paper, to improve the quality of cross-media in- 
formation retrieval system for lecture videos, we evalu- 
ated the effects of adapting acoustic and language models 
on speech recognition. We performed an MLLR-based 
method to adapt an acoustic model. To obtain a corpus 
for language model adaptation, we used the textbook for a 
target lecture to search a Web collection for the pages as- 
sociated with the lecture topic. The experimental results 
for five lectures showed that the methods to adapt acous- 
tic and language models improved the speech recognition 
accuracy independently and when used together the im- 
provement was even greater. 
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